Rollup functions for efficient storage, presentation, and analysis of data

ABSTRACT

Methods of organizing a series of sibling data entities in a digital computer are provided for preserving sibling ranking information associated with the sibling data entities and for attaching the sibling ranking information to a joint parent of the sibling data entities to facilitate on-demand generation of ranked parent candidates. A rollup function of the present invention builds a rollup matrix ( 126 ) that embodies information about the sibling entities and the sibling ranking information and provides a method for reading out the ranked parent candidates from the rollup matrix in order of their parent confidences ( 141 ). Parent confidences are based on the sibling ranking information, either alone or in combination with n-gram dictionary ranking or other ranking information.

RELATED APPLICATIONS

This application is a continuation of prior pending U.S. applicationSer. No. 10/410,015 filed Apr. 8, 2003, which is a continuation of U.S.application Ser. No. 09/528,749 filed Mar. 20, 2000, now issued as U.S.Pat. No. 6,597,809, all of which claim priority to U.S. provisionalapplication Nos. 60/125,352 filed Mar. 19, 1999 and 60/125,257 filedMar. 19, 1999 and all are incorporated herein by this reference.

TECHNICAL FIELD

The present invention relates to computer-implemented methods and datastructures for producing candidate parent entities that are ranked inaccordance with ranking information associated with given child entitiesand, in particular, to such methods for use with software parsers anddata dictionaries, for example, of the kind utilized in a system forautomated reading, validation, and interpretation of hand print, machineprint, and electronic data streams.

BACKGROUND OF THE INVENTION

Optical character recognition (OCR) systems and digital image processingsystems are known for use in automatic forms processing. These systemsdeal with three kinds of data: physical data, textual data, and logicaldata. Physical data may be pixels on a page or positional informationrelated to those pixels. In general, physical data is not in a form tobe effectively used by a computer for information processing. Physicaldata by itself has neither useable content nor meaning. Textual data isdata in textual form. It may have a physical location associated withit. It occurs in, for example, ASCII strings. It has content but nomeaning. We know what textual data says, but not what it means. Logicaldata has both content and meaning. It often has a name for what it is.

For example, there may be region of black pixels in a certain locationon an image. Both the value of the pixels and their location arephysical data. It may be determined that those pixels, when properlypassed through a recognizer, say: “(425) 867-0700.” Content has beenderived from the physical data to generate textual data. If we now knowthat text of this format (or possibly at this location on a preprintedform) is a telephone number, the textual data becomes logical data.

To facilitate reconciliation of imperfections in physical data andshortcomings of the recognition process, each recognized element oftextual data, e.g., a character, may be represented by a ranked group ofunique candidates called a “possibility set.” A possibility set includesone or more candidate information pairs, each including a “possibility”and an associated confidence. In the context of an OCR system, theconfidence is typically assigned as part of the recognition process. Forcomputational efficiency, the confidences may be assigned within anappropriate base-2 range, e.g., 0 to 255, or a more compact range, suchas 0 to 7. For example, FIG. 1 shows an enlarged view of an individualglyph 20 that may be physically embodied as a handwritten character oras a digital pixel image of the handwritten character. From glyph 20, anoptical character recognition process may generate the possibility setshown in TABLE 1 by assigning possibilities and associated confidences:

TABLE 1 possibility confidence c 200 e 123 o 100

FIG. 2 shows a series of sibling glyphs 22, which are known as“siblings” because they share the same parent word 24. The siblingglyphs 22 can be represented by the four possibility sets as shown inthe following TABLE 2:

TABLE 2 poss conf poss conf poss conf poss conf c 200 h 190 o 100 r 125o 150 n 100 a 80 n 100 e 100 r 80The possibilities of these four possibility sets can be readily combinedto form 36 unique strings: “chor”, “ohor”, “ehor”, “cnor”, “cror”, etc.The number of unique strings is determined by the product of the numberof character possibilities in each possibility set, i.e., 3×3×2×2=26.

To gage or verify their accuracy, the unique “candidate” strings may beprocessed by a “dictionary” of valid outcomes. In the context of OCR, adictionary is a filter. It has content and rules. Each candidate stringprocessed by the dictionary is subject to one of three possibleoutcomes: it is passed, it is rejected, or it is modified into a similarstring that passes. One example of a dictionary is based on the Englishlanguage. For parent word 24 of FIG. 2, the candidate strings “chor” and“ehar” would be rejected by such a dictionary, while “char” would bepassed.

Because dictionaries often have a very large amount of content againstwhich a candidate string is compared, it may be unduly time-consuming toapply the dictionary to all possible strings. To improve efficiency, itis desirable, before applying a dictionary, to rank the candidatestrings in order of some confidence based on the accuracy ofrecognition. In this way the candidate strings having the highestconfidence of having been accurately recognized are processed by thedictionary first. Rules can then be used to determine when to stopdictionary processing, e.g., when enough candidate strings have beenprocessed to have isolated the best candidate strings (with a certainprobability). A convenient way to rank candidate strings is to calculatestring confidences based on the confidences of the component characterpossibilities that make up each candidate string. A set of candidatestrings and their associated string confidences is referred to as an“alt-set.”

One way to rank parent candidates for creating an alt-set is to add thechild confidences for each parent candidate. In the above example,“chor” would have a ranking of 615 (the sum of the confidencesassociated with the individual characters c-h-o-r), “ohor” would have aranking of 565, “ehor” would have a ranking of 515, etc. Combining thepossibility sets to form the 36 unique strings and to calculate theirrankings is simple in this example. However, there is no obvious way toread the strings out in ranked order. The strings must first be assigneda ranking, then ordered or sorted based on their assigned rank. Thisordering or sorting step becomes especially problematic for longerstrings formed from sibling possibility sets having a greater number ofpossibilities. By way of illustration, a hypothetical 10-characterparent word in which each child possibility set includes 10possibilities would result in 10 billion unique strings. It would be avery time-consuming and computationally expensive task to rank and order10 billion 10-character strings.

Another known way of improving the efficiency of dictionaries is to usespecialized dictionaries that contain smaller amounts of content than amore generalized dictionary but that are limited in their application.One such specialized dictionary is an “n-gram” dictionary, whichincludes information about the frequency in which certain charactersequences (e.g., two-letter, three-letter, etc.) occur in the Englishlanguage. For example, the two-letter combination “Qu” (a 2-gram) occursin English words much more frequently than “Qo.” To benefit from ann-gram dictionary, the confidence assigned to an n-gram is somecombination of (1) the aggregate character confidences and (2) then-gram frequency provided by the n-gram dictionary. Thus, recognitionmay have produced Oueen and Queen where the first character has thepossibility set: poss=O, conf=200; poss=Q, conf=100, but in the Englishlanguage “Qu” happens much more often than “Ou”, so the 2-gramdictionary would help determine that Queen is the more likely parentstring.

A need exists for a method of generating candidate strings in rankedorder on an as-needed basis and, more generally, for a method ofgenerating ranked parent candidates on an on-demand basis from a seriesof sibling possibilities. A need also exists for such a method that canbe used with data at different logical levels in a logical datahierarchy, such as n-grams, words, and phrases.

SUMMARY OF THE INVENTION

In accordance with the present invention, methods of organizing a seriesof sibling data entities are provided for preserving sibling rankinginformation associated with the sibling data entities and for attachingthe sibling ranking information to a joint parent of the sibling dataentities to facilitate on-demand generation of ranked parent candidates.A rollup function of the present invention builds a rollup matrixcontaining information about the sibling entities and the siblingranking information and provides a method for reading out the rankedparent candidates from the rollup matrix in order of their parentconfidences, which are based on the sibling ranking information. Parentconfidences may also be based, in part, on n-gram ranking or otherranking information.

External to the rollup function of the present invention, siblingentities are generated and passed to the rollup function for processing.Generation of a series of sibling entities may, in the context of OCR,involve optical scanning, recognition processing, and parsing. Eachsibling entity comprises one or more ranked child possibilities, eachhaving an associated child confidence. The number of child possibilitiesin a sibling entity is referred to as the “child population” of thesibling entity. Each sibling entity may include a range of childconfidences, one of which is the maximum child confidence.

In one aspect of the invention the rollup function is implemented incomputer software operable on a digital computer. The rollup matrix ismodeled as a three-dimensional data array called a rollup table. Therollup table serves as a convenient visual aid to understanding thenature of the rollup matrix and operation of the rollup function. Whatis the matrix? It should be understood that nothing in the foregoingdescription of the rollup table should be construed as limiting thescope of the invention to implementation of the rollup matrix in dataarrays. Other data structures, such as linked lists, are also suitablefor implementing the rollup function of the present invention. It shouldbe understood, therefore, that the term “rollup matrix” as used hereinshall mean data tables, linked lists, and any other device for definingrelationships between nodes in a data structure, where such nodesinclude one or more elements of data and one or more relationships toother nodes, procedures, or nested rollup functions. Furthermore, itwill be apparent from the foregoing description of the invention thatwhile the invention is suitable for use with OCR technology, it is alsosuitable for use with processing of other types of content-bearing datain which uncertainty in the data content is sought to be resolved.Non-OCR applications of the invention involving resolution of empiricaluncertainty may include, for example, bioinformatics systems foranalyzing gene sequencing information.

After receiving a series of sibling data entities, a matrixinitialization routine of the rollup function establishes a rollup tableand sizes it based on properties of the sibling entities. In particular,the rollup table is sized to include a series of “columns” equal innumber to the number of sibling entities received. The dimension of therollup table spanned by the columns is referred to as the “width” of thetable. The rollup table is sized in a “height” dimension based on anumber of “rows,” with each having a row position indicating itsposition along the height dimension of the data table. The number ofrows, and consequently the height of the table, is based on the sum ofthe maximum child confidences of the sibling entities. In practice, thenumber of rows may be established as equal to the sum of the maximumchild confidences plus one. The rollup table is sized in a “depth”dimension based on the largest of the child populations of the siblingentities. The rollup table is a collection of “nodes,” each located inthe rollup table at a position defined by column, row position, and adepth position in the depth dimension.

Once the rollup function has established the rollup table, a loadingroutine of the rollup function then loads the sibling entities into therollup table in a predetermined loading sequence beginning with loadinga first sibling entity in a first column of the series of columns. Eachsibling entity is loaded in sequence, from the first sibling entity tothe last sibling entity in the series. If the sibling entities have noserial relationship, then an arbitrary, but ordered sequence of loadingis chosen. Each child possibility of the first sibling entity is loadedinto a node of the rollup table located at the first column and at therow having a row position corresponding to the child confidence of thechild possibility being loaded. The rollup function then proceeds toload the second sibling entity in the series in a second column. For thesecond and each subsequent sibling entity and column, the rollupfunction loads each child possibility in one row of the current columnfor each row of the immediately preceding column having a filled node.The child possibilities of the second sibling entity are loaded in rowsof the second column that have row positions offset from the rowpositions of filled nodes of the immediately preceding column (i.e., thefirst column) by an offset amount corresponding to the child confidenceof the child possibility being loaded in the second column. The childpossibilities of the third sibling entity are loaded in rows of thethird column having row positions offset from the row positions offilled nodes of the second column by an offset amount corresponding tothe child confidence of the child possibility being loaded in the thirdcolumn, and so on, until the last sibling entity has been loaded in thelast column of the rollup table. Each entry in the last column of therollup table is a terminal element. Due to different confidence valuesthat may be associated with multiple child possibilities of each of thesibling entities, the loading sequence may result in the loading ofmultiple elements in a particular column and row position. Duringloading, if a node has already been filled with a child possibility, theloading routine offsets in the depth of the rollup table until itreaches an unoccupied node, then fills that node.

Upon completion of the loading sequence, another aspect of the inventioninvolves a roll-out routine of the rollup function, which may be used toread parent candidates from the rollup table according to their parentconfidences. The reading of parent candidates, known as “roll-out,”begins with a terminal element known as an entry point. Each parentcandidate is assembled in a sequence opposite the sequence in which therollup table was loaded, as follows: After reading a terminal elementfrom the last column, the roll-out routine then reads a next-to-lastelement from the node located at a next-to-last column immediatelypreceding the last column and at a row position less than the rowposition of the entry point by an amount equal to the child confidenceassociated with the terminal element. The next-to-last element is thenprepended to the terminal element to form a string tail. A prefixelement is read from a node located in the column immediately precedingthe next-to-last column and at a row position less than the node of thenext-to-last element by an amount equal to the confidence of thenext-to-last element. The prefix element is then prepended to the stringtail. If the sibling entities forming the rollup table have no serialrelationship, then prepending involves combining the elements in reverseorder of their loading in the rollup table. This reading process isrepeated until the roll-out routine reaches the first column, completingroll-out of the parent candidate. If more than one element is located ata particular column and row location (i.e., elements are stored at morethan one depth position), then the roll-out routine will continuereading parent candidates beginning from the same entry point untilelements at all occupied nodes at all depths in the appropriate columnsand rows have been read and all parent candidates having the same parentconfidence have been rolled out, or until the desired number of parentcandidates have been rolled out. The roll-out process is merely repeatedfor further parent candidates.

The method of loading the data table dictates that each row positioncorresponds to the parent rank of each parent candidate assembled from aterminal element located at that row position. The parent candidate (orcandidates) with the greatest parent confidence may be read from therollup matrix by beginning at a maximal node located at the last columnand at the row of greatest row position. Consequently, parent candidatesmay be read in decreasing order of parent rank by merely assemblingparent candidates in sequence, beginning with terminal element(s)located at the maximal node and continuing to read from the rollup tableat entry points of decreasing row position until all parent candidateshave been assembled. The process of building a rollup matrix androlling-out parent candidates to form alt-sets can be repeated at eachlevel in the data hierarchy. If desired, rollup functions can be nestedby storing a nested “child” rollup function pointer at a node of aparent roll-up table.

Given the foregoing description of the invention, the use of softwarecounters to facilitate the loading of the rollup matrix and the roll-outof parent candidates will be understood by those skilled in the art.

In another aspect of the invention, the rollup matrix is established ina computer memory using a plurality of memory pointers in place of the3-dimensional data array of the rollup table. In this aspect of theinvention, the terms “rows” and “columns” are arbitrary but are usedherein to denote memory locations within the rollup matrix. In reality,each node of the rollup matrix includes a pointer to other nodes whichcontain a child possibility of an adjacent sibling entity. If a nodemust point to more than one child possibility, as in the case ofmultiple child possibilities at a particular column and row position,the node will include multiple pointers. When these multi-pointer nodesare encountered by the roll-out routine, a branch is indicated so thatall pointers of each node are followed before moving to the next entrypoint.

Nodes occupying entry points shall be referred to as “entry nodes.”Entry nodes further include a parent confidence which the roll-outroutine recognizes as assigned to the parent candidate assembledbeginning with the entry node. Entry nodes may also include a pointer tothe next entry node in the matrix, which may have the same parentconfidence or a lesser parent confidence. Nodes in the “first column,”loaded with a child possibility of the first possibility set, mayinclude a return pointer that may direct the roll-out routine to outputthe completed parent candidate for verification (e.g., using adictionary) or to proceed to the next entry node for generation of thenext parent candidate. Nodes at any location in the rollup matrix mayalso include a pointer to an entry node of a nested rollup matrix.

In yet another aspect of the invention, n-gram possibility sets aregenerated using a n-gram rollup function in accordance with the presentinvention. Comparison of parent candidate n-grams against an n-gramdictionary allows n-gram candidates to be weighted in accordance withtheir relative frequencies of occurrence in the context of, for example,the English language. Possibility sets including n-grams are readilyaccommodated in establishing the rollup matrix. For 3-grams, the nodesare loaded with the 3-grams at a row position which is the aggregate ofthe confidence of the central character (of the 3-gram) and thedictionary-provided frequency of the 3-gram. In this aspect of theinvention, child possibilities in the first and last columns of therollup matrix must be prepended and appended, respectively, with nulls(or spaces) so that all child possibilities are 3-grams. Further, the3-gram child possibilities must be loaded in the rollup matrix so thatwhen the parent candidates are rolled-out, all adjacent 3-gramsassembled in a parent candidate share two characters. For example, “out”in the first column will fit with “uts” in the second column, but notwith “nts.”

In the context of OCR, the rollup function of the present invention isuseful at every level of textual hierarchy. Rollup functions also avoidfatal problems often encountered by prior art string generators, whichcreate strings from a series of possibility sets. Existing stringgenerators suffer from three major problems. First, they arecombinatorically expensive in memory use-needing a place in memory foreach possible string. Second, string generators must trim strings beforegenerating all possible strings because of limited space to store thecombinatorically-many strings. Therefore, it is possible for stringgenerators to result in higher-confidence strings being abandoned whilelower-confidence strings are preserved. Third, string generators do notguarantee that strings of the same confidence, once ordered, retain thatorder.

The present invention gets around all these problems in a natural way.First, the rollup function is only geometrically expensive of memory,not combinatorically. Tables generated by prior art systems grow asL×n^(L), where n is the number of possibilities per possibility set andL is the number of possibility sets (i.e., the string length). There aren^(L) strings of length L that can be generated. By comparison, therollup matrix of the present invention grows as 2×CF_(max)×L², whereCF_(max) is the highest confidence value in any possibility set. Asignificant savings over prior art systems. For L=10, n=3, andCF_(max)=20, and allowing 1 byte per ASCII character, approximately590,490 bytes would be required for ranking tables of prior art systems;while only 12,000 bytes are required for the rollup matrix—a savings of98%. Second, candidate strings can be read out of a rollup table intheir decreasing order of confidence without having to store unneededstrings in memory, while never skipping a higher-confidence parentcandidate for a lower confidence one. The rollup matrix does not changesize with the number of generated strings. Therefore, all strings arepreserved and there is no trimming of strings ever required. Third, noreordering of parent strings ever takes place because the rollup matrixis unchanging. Consequently, strings of the same confidence remain intheir original order.

Parent candidates can be read from the rollup matrix in decreasing orincreasing order of parent confidence. First, a parent candidate havinga desired confidence value can easily be selected from the matrix by aconfidence stored in association with an entry node of the parentcandidate. Parent candidates having lesser (or greater) confidences canthen be read until a desired lesser (or greater) confidence level isreached. This process can be repeated until a predetermined number ofparent candidates have been obtained or until all possible parentcandidates have been rolled-out. The rollup function can be interruptedwhile reading out a parent candidate to handle some other process, suchas verifying the most recently rolled-out parent candidate using adictionary. The rollup function easily returns to where it left off inthe rollup matrix to read out the next-ranked parent candidate byreturning to the location in the rollup matrix that was being accessedwhen the interruption occurred. The rollup function of the presentinvention provides the above-described benefits without requiring theproduction of all of the parent candidates before subsequent ranking. Ifa particular child possibility occurs with at most one confidence valuein a possibility set, then the last rolled-out string is the pointerstructure. Even in the case of allowed duplication, returning to therollup function is as simple as storing a pointer to the next entrypoint in the rollup matrix and storing a pointer to each position of thetable, which may be accomplished by freezing the internal pointerstructure.

The rollup function of the present invention is, of course, not limitedto strings. Any parent entity can receive rollup-produced alt-sets fromits child entities. For example, gene sequence information prepared froma human, an animal, a plant, or any other living organism may be parsedinto its nucleotides, each of which may be represented by an alt-set.Sibling nucleotide alt-sets can then be loaded into a rollup matrix forthe parent gene. In this way, the frequency of naturally-occurringnucleotide and coding sequence variations can easily be represented bythe child confidences associated with child possibilities of eachalt-set. Inaccuracies inherent in the gene sequencing process can besimilarly represented by the child confidences.

Additional aspects and advantages of this invention will be apparentfrom the following detailed description of preferred embodimentsthereof, which proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an enlarged view of a hand printed glyph;

FIG. 2 is an enlarged view of a series of sibling glyphs;

FIG. 3 is a flow diagram depicting an OCR process for scanning, parsing,and recognizing handwritten data to create possibility sets for use witha data verification routine of the present invention;

FIG. 4 is a flow diagram showing detail of the data verification routineof FIG. 3 including a rollup function and dictionary routine inaccordance with a preferred embodiment of the present invention;

FIG. 5 is a pictorial view of a three-dimensional data array inaccordance with a first preferred embodiment of the present invention;

FIGS. 6A, 6B, 6C, and 6D are two-dimensional pictorial views of a rollupmatrix in accordance with the present invention showing a loadingsequence for loading the alt-sets of Table 3 into the rollup matrix;

FIG. 7 is an exploded three-dimensional view of the loaded rollup matrixof FIG. 6D;

FIGS. 8A, 8B, 8C, and 8D are show a sequence of rolling out a parentcandidate from the loaded rollup matrix of FIG. 6D;

FIG. 9 is a diagram of an alternative embodiment of the rollup matrix ofFIG. 6D including a linked list implemented in a computer memory;

FIG. 10 is a flow diagram showing steps taken in preparation andvalidation of n-gram alt-sets for loading in a rollout matrix for aparent string of the n-grams;

FIG. 11 is a two-dimensional pictorial view showing nested rollupmatrices;

FIG. 12 is a flow diagram showing steps for establishing and loading ofthe nested rollup matrices of FIG. 11; and

FIG. 13 is flow diagram showing parent candidates being rolled out fromthe nested rollup matrices of FIG. 11.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 3 is a flow diagram of an OCR process 30 in accordance with a firstpreferred embodiment of the present invention. With reference to FIG. 3,a document 32 bearing physical textual data is scanned using an opticalscanner 34, which produces a digital pixel image of the physical data ondocument 32. A segmentation process 36 of the OCR process 30 receivesthe pixel image from the optical scanner and segments the pixel imageinto data segments for processing by a recognizer 38. Recognizer 38analyzes the data segments to produce a possibility set (“pos-set”) foreach data segment. Empirical uncertainty in the physical data andinaccuracies of the scanning, segmentation and recognition process arerepresented in the pos-sets by including multiple child possibilities ineach pos-set and by assigning child confidences to the childpossibilities. For example, recognizer 38 separates a parent string (asin the parent word 24 of FIG. 2) into its sibling glyphs and outputs apos-set for each glyph. The pos-sets are output to a data verificationroutine 40, which uses a rollup function 60 (FIG. 4) and possibly one ormore dictionaries 150 (FIG. 4) in accordance with the present invention.

FIG. 4 is a flow diagram of rollup function 60 of data verificationroutine 40 (FIG. 3). With reference to FIG. 4, a matrix initializationroutine 62 of rollup function 60, receives pos-sets 64 from recognizer38. FIG. 5 is a pictorial view of a three-dimensional data array 66,which represents a data matrix in accordance with the present invention.Data array 66, includes rows 70, columns 72, and tiers 74 that togetherform nodes 76. With reference to FIGS. 4 and 5, matrix initializationroutine establishes a size of data array 66 based on pos-sets 64. Forpurposes of a simple illustration, TABLE 3 presents four siblingpos-sets.

TABLE 3 poss conf poss conf poss conf poss conf a 2 n 1 t 1 s 1 o 1 u 05 0A first pos-set shown in TABLE 3 includes two child possibilities, “a”and “o”, which are assigned child confidences 2 and 1, respectively. Asecond pos-set includes child possibilities n and u, having associatedchild confidences 1 and 0, respectively. And so on. The matrixinitialization routine calculates a sum of the maximum confidences ofthe four pos-sets (2+1+1+1=5) and adds one (5+1=6) to establish a height80 of data array 66. Data array 66, thus, includes six rows 70, havingrow heights R0, R1, R2, R3, R4, and R5. A width 82 of data array 66 isequal to the number of pos-sets 64. A depth 84 of data array 66 is equalto the largest number of child possibilities in any of the pos-sets 64.In this example, three of the pos-sets are equally large, having twochild possibilities.

Once data array 66 has been established and sized, a loading routine 90of rollup function 60 loads pos-sets 64 into data array 66. FIGS. 6A,6B, 6C, and 6D depict a loading sequence followed by loading routine 90.With reference to FIG. 6A, a data table 92 provides a two-dimensionalrepresentation of the three-dimensional data array 66 of FIG. 5,including four columns C1, C2, C3, and C4, each of which is divided bybroken lines to indicate tiers 74 of data array 66 (FIG. 5). Loadingroutine 90 loads the child possibilities 94 of the first pos-set intothe first column C1 so that each child possibility 94 is loaded in anode 96 at a row position equal to the child confidence 98 correspondingthe child possibility 94. Thus, child possibility “o”, which has anassociated child confidence of one is loaded at the node located at rowR1, and child possibility “a” is loaded at row R2 because it has anassociated child confidence of two.

When loading routine 90 completes loading of the first pos-set (FIG.6A), it proceeds to load the second pos-set into data table 92. Withreference to FIG. 6B, each child possibility of the second pos-set isloaded in one node 96 of the second column (C2) for each row of thefirst column (C1) having filled nodes, but at a row height greater thanthe row height of the filled nodes 96 of column C1 by an amount equal tothe child confidences being loaded. Thus, child possibility “u” having achild confidence of zero is loaded in nodes located at rows R1 and R2 ofcolumn C2, since rows R1 and R2 are filled in column C1. Childpossibility “n” is loaded in nodes located at rows R2 and R3 of columnC2, which are greater than the row positions of the filled nodes (R1 andR2) of column C1 by an amount equal to the child confidence (one)associated with child possibility “n.” Because the node located at C2,R2, TO, is already filled with child possibility “u”, loading routine 90loads child possibility n at node C2, R2, Ti so that no more than onechild possibility is loaded in each node.

Loading routine 90 then continues to load successive pos-sets 64 insequence in successive columns, as depicted in FIGS. 6C and 6D, untilall pos-sets 64 have been loaded in data table 92. As in column C2,child possibilities 94 are loaded in nodes 96 located at row positionsthat are greater (by an amount equal to the child confidence of thechild possibility being loaded) than the row position(s) of rows of theimmediately preceding column that have filled nodes. Nodes of the lastcolumn (C4) that are loaded with child possibilities contain dataentities that are known as terminal elements 100.

FIG. 7 is an exploded view of the loaded data table 92 of FIG. 6Dshowing its loaded data in a three-dimensional representation inaccordance with three-dimensional data array 66 of FIG. 5.

To extract parent candidate strings from data table 92, a roll-outroutine 110 of rollup function 60 is provided (FIG. 4). FIG. 8A depictsthe steps taken by roll-out routine 110, in rolling out parent candidate“ants”, i.e., the parent candidate comprising the sibling characters“a”, “n”, “t”, and “s”. Parent candidate “ants” has the greatestaggregate confidence of any of the parent candidates because itsterminal element (“s”) 100 is located in the row of data table 92 havingthe greatest row position (R5), i.e., a maximal terminal element 112.With reference to FIG. 8A, roll-out routine 110 reads from columns C4,C3, C2, and C1, in the order opposite to which the columns were loaded.Terminal element “s” 100 (which is also the maximal terminal element112) is read initially. Next, roll-out routine 110 reads next-to-lastchild element “t” 116 from the immediately previous column (C3) and fromrow R4, which has a row position less than the row position of terminalelement “s” by the amount of the child confidence associated withterminal element “s” (i.e. one). Roll-out routine 110 prependsnext-to-last child element “t” to the terminal element “s” to form astring tail of “ts.” The child confidence of one associated withnext-to-last child element “t” 116 then directs roll-out routine to readprefix element “n” 118 from row R3, column C2 (because row R3 has a rowposition one less than the row position of R4). Roll-out routine 110prepends prefix element “n” 118 to the string tail “ts”, to form thepartial string “nts.” Element “a” 120, is then read because it is loadedin row R2, which is one less (the child confidence associated withprefix element “n” 118) than the row position of prefix element “n” 118.Element “a” 120 is prepended to complete the formation of candidateparent string “ants”. The parent confidence associated with “ants” isequal to five, which is the row position of the terminal element 100 aused to extract “ants”.

FIG. 8B depicts the steps taken by roll-out routine 110, in rolling outparent candidate “ant5”. With reference to FIG. 8B, terminal element “5”has an associated child confidence of zero, which directs roll-outroutine to read next-to-last element “t” from the same row position (R4)in column C3. The parent confidence associated with “ant5” is equal tofour, which is the row position of terminal element “5” 100 b used toextract “ant5”.

FIGS. 8C and 8D depict the steps taken by roll-out routine 110, inrolling out respective parent candidates “auts” and “onts.” Becausethere are two entries in row R2, column C2, roll-out routine 110 rollsout two unique parent candidates ending with terminal element “s” 100 c,both having an associated parent confidence of four, which is equal tothe row height of row R4, where terminal element “s” 100 c is located.

In accordance with an alternative embodiment of the present invention,FIG. 9 shows the loaded data table 92 of FIGS. 6D and 7 embodied as alinked-list rollup matrix 126. With reference to FIG. 9, rollup matrix126 includes a pointer structure 128 to nodes 96. To roll-out the parentcandidate “ants”, roll-out routine 110 starts at an initial entry point130 that includes terminal element 100 a (element “s” of maximalterminal element 112). Roll-out routine 110 then reads out elements “t”116, “n” 118, and “a” 120 by following respective pointers 134, 136, and138 and prepends them to element “s” 100 a. A return pointer 140indicates to roll-out routine 110 that it has completed construction ofthe parent candidate. A parent confidence 141 of the parent candidate“ants” is stored in association with the terminal element “s” 100 a. Allterminal elements of rollup matrix 126 serve as entry points 142 forrolling out one or more parent candidates. As in the roll-out sequencesshown in FIGS. 8C and 8D, two parent candidates can be rolled out ofrollup matrix 126 by beginning with terminal element “s” 100 c. A branchnode 144 of rollup matrix 126 includes two pointers 146, 148, whichindicate to roll-out routine 110 that two different parent candidatesuse branch node 144 and that roll-out routine 110 needs to execute abranch at branch node 144. Those skilled in the art will understand thatmore than one branch node may clearly exist in rollup matrix, and thatsome branch nodes will have more than two pointers (if the matrix is“deeper” than 2 tiers).

After rolling out of each parent candidate (typically in decreasingorder of parent confidence), rollup function may output each parentcandidate to a dictionary routine 150 (FIG. 4) for validation using anappropriate parser and dictionary. One embodiment of handling dictionaryprocessing is shown in FIG. 4, and includes conditional iteration ofroll-out routine 110. An iteration step 154 is conditional upon whetherthe parent candidate output by roll-out routine 110 passes thedictionary test (160) and, if it does, whether some other stop limit 170has been met. For example stop limit 170 may trigger OCR process 30(FIG. 3) to terminate verification of the parent element represented byrollup matrix 126 (and rollup table 92), and to load the next series ofpos-sets scanned and recognized from document 32.

FIG. 10 is a flow diagram showing steps taken in preparation andvalidation of n-gram alt-sets for loading in a rollout matrix for aparent string of the n-grams. With reference to FIG. 10, an n-gramverification process 200 receives pos-sets from OCR system (step 210)and assembles them in computer memory to form a ranked list of n-gramcandidates (step 212). N-gram candidates within a single ranked list mayhave different lengths, for example when one of the pos-sets includesboth an “m” possibility and an “rn” possibility. To accommodate n-gramcandidates having different lengths, a length gage routine 214 of n-gramverification process 200 determines the length of each n-gram candidate.The n-gram candidates are then processed by an appropriate n-gramdictionary 216. N-gram dictionary 216 is a specialized dictionary orcollection of specialized dictionaries that includes information aboutfrequency of occurrence of n-grams (for example 2-grams, 3-grams, etc.)in written language or some subset of written language. N-gramdictionary 216 assigns an n-gram confidence to each n-gram candidatebased on (i) the dictionary frequency rating for the n-gram and (ii) achild confidence associated with a central character of the n-gramcandidate. N-gram and its associated n-gram confidence are then appendedto an n-gram alt-set (step 218). Steps 214, 216, and 218 are thenrepeated until all of the lists of n-gram parent candidates have beenprocessed through the dictionary and output as n-gram alt-sets. Afterall n-gram alt-sets have been completed, a string-sized rollup matrix isbuilt using the alt-sets as sibling entities (step 220). Parent stringcandidates can then be rolled out of string-sized rollup matrix inranked order (step 222) and processed using a string dictionary (step224) before outputting ranked parent strings (step 226).

FIG. 11 is a two-dimensional pictorial view showing nested rollupmatrices 240 established in accordance with the present invention. Withreference to FIG. 11, nested rollup matrices 240 include a child rollupmatrix 250 nested within a parent rollup matrix 260. Child rollup matrix250 is said to be “nested” because complete candidates that may berolled out of child rollup matrix 250 are referenced by pointers withinparent rollup matrix 260. In this example, child rollup matrix 250represents candidate city names in a typical rollup matrix in accordancewith the present invention. However, any child entity can be representedin a nested child rollup matrix. Parent rollup matrix 260 is a typicalrollup matrix in accordance with the present invention. In this example,parent rollup matrix 260 includes sibling city, state, and zip-codealt-sets. First and second city nodes 262, 264 of parent rollup matrix260 include respective first and second city pointers 266, 268 torespective first and second entry points 270, 272 of child rollup matrix250. First and second entry points 270, 272 are terminal nodes of childrollup matrix 250 having associated city confidences 274, 276. While thenested rollup matrices 240 of FIG. 11 include only one nested childmatrix, it would be straightforward to nest multiple child matriceswithin a single parent rollup matrix. Likewise, it would be simple tocreate a hierarchy of nested rollup matrices including three or morelayers of rollup matrices, rather than the two layers (child rollupmatrix 250 and parent rollup matrix 260) of FIG. 11.

In setting up nested rollup matrices 240, child rollup matrix 250 isestablished before establishing parent rollup matrix 260. This order ofestablishing nested rollup matrices 240 insures that city confidences274, 276 of child rollup matrix 250 may be taken into account whenestablishing, sizing, and loading parent rollup matrix 260. When loadingfirst and second city pointers 266, 268 in parent rollup matrix 260,city confidences 274, 276 of child rollup matrix 250 determine howparent rollup matrix 260 is loaded.

FIG. 12 is a flow diagram showing steps for establishing and loading ofthe nested rollup matrices of FIG. 11. With reference to FIG. 12, achild rollup matrix is first established and loaded (step 300). Onceloaded, entry points for child candidates of the child rollup matrix,and their associated child confidences are available. These childcandidates, entry points, and child confidences are then taken intoaccount in establishing and sizing parent rollup matrix (step 310).Parent rollup matrix is then loaded (step 320). In the example of FIG.11, parent rollup matrix 260 is loaded with a zip-code (postal code)alt-set in its terminal column and a state alt-set in its next-to-lastcolumn. Parent rollup matrix is also loaded with city pointers 266, 268to appropriate entry points 270, 272 of child rollup matrix 250. Afterparent rollup matrix has been loaded (step 320), ranked parentcandidates may then be rolled out (step 330) for processing by adictionary. The dictionary required for use with the nested rollupmatrices 240 shown in the example of FIG. 11 would be a city-state-zipdictionary for verifying specific city-state-zip combinations.

FIG. 13 is flow diagram showing a sequence of steps for rolling out aparent candidate from the nested rollup matrices 240 of FIG. 11. Withreference to FIG. 13, a nested roll-out routine 400 starts at an entrypoint, which is a terminal parent node of a linked list of parent matrix(step 410). All subsequent steps shown in FIG. 13 are identicalregardless of whether the current node is a terminal node or anothernode of nested rollup matrices 240. Nested roll-out routine 400 nextdetermines whether the parent node includes a pointer to a nested childmatrix (step 420). If not, then nested roll-out routine 400 reads theelement stored in the current node (step 430) and prepends it to aparent candidate tail. Nested roll-out routine 400, then determineswhether the node includes a return pointer that would indicatecompletion of the parent candidate (step 440). If not, then nestedroll-out routine advances to the next node in the linked list (step 450)and returns to step 420. If a parent node includes a nested matrixpointer to a nested rollup matrix (at step 410) then nested roll-outroutine 400 proceeds to store in memory an address of the parent nodethat includes the nested matrix pointer (step 460). Nested roll-outroutine 400, then rolls out a child candidate from the nested childmatrix (step 470), prepends the child candidate to the parent candidatetail (step 480). Nested roll-out routine then restores the address ofthe last-read parent node, which was previously stored in memory andreturns to the parent rollup function (step 490), continuing on at thelast read parent node.

When a parent node includes a return pointer (step 440), nested roll-outroutine completes its assembly of parent candidate and processes itusing dictionary process 500. If the parent candidate passes thedictionary test, it is output. The nested roll-out function can berepeated for each terminal node of parent roll-out matrix to completeroll out of all parent candidates.

It will be obvious to those having skill in the art that many changesmay be made to the details of the above-described embodiments of thisinvention without departing from the underlying principles thereof. Thescope of the present invention should, therefore, be determined only bythe following claims.

1. A computer-implemented system for organizing a set of siblingentities each having one or more child possibilities, at least one ofthe sibling entities including multiple child possibilities having arelative rank or confidence value and_from which multiple parentcandidates can be generated, each of the parent candidates having arelative rank, and for generating an ordered series of parent candidatesfrom the child possibilities, comprising: a means for initializing aplurality of nodes in a computer-readable data storage medium forstoring the child possibilities of the sibling entities; a means forloading the sibling entities into the nodes to form a rollup matrixhaving an organization that represents the relative ranking of theparent candidates; and a means for reading from the nodes to generate aseries of parent candidates in order of their ranking.
 2. The system ofclaim 1, further comprising: a means for calculating a parent candidateconfidence for at least some of the parent candidates; a means forstoring the parent candidate confidences in the rollup matrix inassociation with the corresponding parent candidates; and in which themeans for reading from the nodes generates the series of parentcandidates based on the stored parent candidate confidences.
 3. Thesystem of claim 1, further comprising a means for comparing thegenerated parent candidates against a dictionary.
 4. The system of claim1 in which: at least one of the sibling entities includes a nested childmatrix having an entry point; and the means for loading includes a meansfor loading the nested child matrix into one or more of the nodes, ameans for creating a pointer to the entry point, and a means for storingthe pointer in the rollup matrix.
 5. A computer-implemented method fororganizing a set of sibling entities each having one or more childpossibilities, at least one of the sibling entities including multiplechild possibilities having a relative rank or confidence value and fromwhich multiple parent candidates can be generated, each of the parentcandidates having a relative rank, and for generating an ordered seriesof parent candidates from the child possibilities, comprising:initializing a plurality of nodes in a computer-readable data storagemedium for storing the child possibilities of the sibling entities;loading the sibling entities into the nodes to form a rollup matrixhaving an organization that represents the relative ranking of theparent candidates; and reading from the nodes to generate a series ofparent candidates in order of their ranking; and outputting at least oneof the parent candidates.
 6. The method of claim 5, further comprising:calculating a parent candidate confidence for at least some of theparent candidates; storing the parent candidate confidences in therollup matrix in association with the corresponding parent candidates;and reading from the nodes generates the series of parent candidatesbased on the stored parent candidate confidences.
 7. The method of claim5, further comprising comparing the generated parent candidates againsta dictionary.
 8. The method of claim 5 in which: at least one of thesibling entities includes a nested child matrix having an entry point;and the loading of the sibling entities into the nodes includes loadingthe nested child matrix into one or more of the nodes, creating apointer to the entry point, and storing the pointer in the rollupmatrix.
 9. A method for character recognition in an OCR system, themethod comprising: optically scanning a document to obtain data definingan image; segmenting the image to determine a plurality of siblingglyphs; each sibling glyph comprising an associated possibility set, thepossibility set consisting of at least one alphanumeric charactercandidate information pair, each pair consisting of a respectivecandidate and an associated confidence value; identifying a plurality ofparent candidates based on the sibling glyphs, each parent candidaterepresenting a candidate word; calculating a parent candidate confidencevalue for at least some of the parent candidates; storing the parentcandidate confidences in a rollup matrix in association with thecorresponding parent candidates; and reading from the nodes so as togenerate a series of parent candidate words based on the stored parentcandidate confidence values.